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Abstract 

' Cellular phones are now offering an ubiquitous means for scientists to observe life: how people act, move 

and respond to external influences. They can be utilized as measurement devices of individual persons 
and for groups of people of the social context and the related interactions. The picture of human life that 
emerges shows complexity, which is manifested in such data in properties of the spatiotcmporal tracks 
of individuals. We extract from smartphone-based data for a set of persons important locations such as 
i-C . "home" , "work" and so forth over fixed length time-slots covering the days in the data-set (see also PJ2]). 
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This set of typical places is heavy-tailed, a power-law distribution with an exponent close to -1.7. To 
analyze the regularities and stochastic features present, the days are classified for each person into regular, 
personal patterns. To this are superimposed fluctuations for each day. This randomness is measured by 
"life" entropy computed both before and after finding the clustering so as to subtract the contribution 
of a number of patterns. The main issue, that we then address, is how predictable individuals are in 
their mobility. The patterns and entropy arc reflected in the predictability of the mobility of the life both 
individually and on average. We explore the simple approaches to guess the location from the typical 
behavior, and of exploiting the transition probabilities with time from location or activity A to B. The 
patterns allow an enhanced predictability at least up to a few hours into the future from the current 
location. Such fixed habits are most clearly visible in the working-day length. 

Introduction 

The digitalizcd world is now getting entangled with the daily life of almost everyone. This has various 
consequences from the trivial ones - "being always connected" - to less obvious ones. But, for the scientist 
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this gives a chance to study human life in a quantitative fashion. Aspects that become available include 
the mobility (person X goes from A to B) statistics and the way a person reacts via a phone to some 
external action [3]-[5] . Quantitative sociology then becomes boosted and enhanced by the possibilities the 
phone-based data offers, and allows to study "motifs" and regularities in life [6HE|. A very important 
dimension brought by the omnipresent phone to the modern life is the social context and interactions it 
brings about. [9~Ml4j. 

Human life can obviously be seen to be both random and regular, and its inherent complexity is 
manifested for instance in spatiotcmporal tracks of individuals - "mobility". We are in the following work 
concerned with three fundamental questions: how to describe the regularity of the patterns in the life of 
a person from day to day, what kind of variations there are around that, and how predictable are the 
features of mobility then after all? To this end, we exploit a novel data-set of smartphone users (Methods) 
which allows us to extract the locations the users are associated with and pick the most important ones. 
The raw data is first pruned and filtered. The goal is then to extract the significant locations for fixed 
length time-slots - "where was I from 8 to 9 in the morning?" [T5H213] . The data is thus distilled into 
a discrete format: at time-slot T the user X is in location (or perhaps state) A forming thus triplets 
(X,T,A). 

These places or locations contain the most obvious ones such as "home" , "work" and so forth, but 
it turns out that for both a typical person and the whole set the location statistics follow a fat-tailed 
distribution. The data in the form (X,T,A) can be with a clustering-analysis turned into sets of days, 
that is typical patterns ("working day" etc.): in a certain days the activity of X at T is A, while in others 
it is B for instance. A day at work does not completely follow the pattern of that particular person, so 
noise is superimposed on the pattern. A physicist would say that a person has " ground-states" of typical 
behaviors which are mixed with fluctuations. 

A physics or information theory based measure for the fluctuations is the entropy, so that an individual 
person can be characterized by the number of patterns he/she can be "classified with", and the personal 
life entropy. The entropy is useful to compare when computed before and after finding the patterns by 
clustering [2TJ[32] . We discover that much of the raw entropy comes from the fact that some individuals 
have quite a few regular patterns of life rather than from real randomness. 

The next natural question is whether one can predict the location (or activity by indirect logic) 
of a person based on the current one. The simple probabilistic guess "I know what this person does 
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typically at this time" works fairly well for some, but in general does not do well. This can be improved 
somewhat by considering separately all the sets of days (ground-states, again). However, the empirical 
transfer matrices (related to pairs of the triplets (X,T,A) and (X,T+1,B) from above) include more of 
the information in the same patterns of life: the predictability is much higher utilizing those and the 
further pattern-induced improvement is quite mild. The capability to predict mobility fairly well in fact 
extends quite a few hours into the future. This is somewhat surprising given that the predictability is for 
each person dependent on the personal entropy. 

We also analyze as the most significant pattern the working days in the data-set, clustering the days 
to find such patterns (Methods) . This is done from the same starting point of predictability, but with the 
idea of trying to gain understanding on the origins of the daily random deviations. In concrete terms, we 
find that for a majority of the users in the set the working day has a typical, personal length and a simple 
model suggests, that the daily changes to this can be understood as a combination of a psychological 
trend to stay longer or leave and the daily random reasons to shorten/prolong the stay at work. This 
means, that there is a strong degree of predictability again, and that it is influenced both by personal 
(or by a particular job- dependent) details and personal stochastic variations. 

Results 

Days of phone users and the entropy 

The histogram of the relative time fractions that persons spend per location is shown in Fig. [1] The 
distribution appears to be wide, a power law with an exponent of about 1.7 (1.69 ±0.05); note that 
similar scale-free statistics surface in studies of human mobility [31l23| though the two quantities are not 
the same. For each user in the further analysis below a threshold is used such that locations less frequent 
than 1 % of the most common one are lumped together. Two individual users are demonstrated in the 
Figure Q] including the respective thresholds (7 and 22 significant locations). Note for instance how the 
statistics for user 2 are close to the averaged behavior. It is interesting that such regularities are found for 
individuals. The fraction of locations where some time is spent, that is how many locations are relevant, 
depends on the time during the day or the slot T (Supplementary Information) |24j . 

Two example users, say Mr. Joe Random and Ms. Jane Regular, are depicted in Figure O The 
dominant patterns can be clearly identified as candidates for a typical "working day" and a typical 
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"home day". Two main features are the randomness or deviations present - consider a typical work day, 
where the noise is at both ends of the period at the work location - and the similarity of the clustering 
results. These two choices from extreme ends of the spectrum of the set of users illustrate the variety 
of behaviors. We also compare the results from two different clustering methods (EO and k-means, see 
Methods). The simple question here is how much the results from the analysis depend on the method 
applied. 

Next, we compute a personal entropy over the time slots by 

Ej=i Eiei l ij/D tot log(lij/D to t) 
£ = N (1) 

where N is the number of time slots in one day (columns), D tot is the total number of days (rows), / 
is the set of all visited locations and hj is the number of times location i dominates time slot j. A low 
entropy means the user is regular and vice versa. However, part of the randomness is only apparent due 
to the patterns. The extreme limit is when the split perfectly among the different patterns (e.g. one is 
only at work in a certain slot in one, and elsewhere in the others). Thus one should consider a clustered 
entropy that gives in that limit the correct result, zero. Such one is the weighted average of the single 
cluster entropies: 

Ecec E^Li Eie/ l vc logfajc) 

s c = (2) 

D tot N K ' 

where D c is the number of days that belong to cluster c and Z, JC is the number of times location i is the 
most dominant location at time slot j in cluster c, Pij C = Zy c / D c and C is the set of clusters. The drop 
in entropy due to the clustering is then defined as 

Ae [%] = 100 (1 - ^). (3) 

e 

This gives also a measure for the "goodness" of clusterings (the higher Ae[%] the better). When consid- 
ering entropies one should recall that the clustering has been done a priori, without using entropy change 
to guide it. One possibility would be to look for the patterns that minimize the entropy, in other words. 
For the fc-means case, the example which minimizes the distances to the cluster centroids is picked, which 
is in practice close to taking the smallest entropy value over the trials. 

Joe and Jane exhibit different entropies and relevant patterns. Joe has a bare entropy of 2.70 decreas- 



5 



ing to 1.25 and 0.78 after EO and fc-means clusterings, respectively (54 and 71 % changes). Joe's life has 
16 typical patterns. For Jane (3 patterns) the entropy is 0.44, and after clustering 0.21 (53 % decrease). 
The results over the user set are summarized in Fig. ordered according to the entropy drop. A picture 
emerges of large entropy variations from user to user, and a trend in the entropy with the numbers of 
significant locations and patterns. The latter two fluctuate from person to person, but not with any kind 
of power-law statistics as that seen in Fig.[T] There are thus two contradicting observations: the entropies 
vary quite some, while the location statistics present regularities. The cumulative entropy histograms 
are presented in Figure [3}d. It appears that they arc quite close to log-normal distributions, though the 
tail properties can not be resolved convincingly with the histograms at hand. The question of where do 
the "rare" locations that exist in the location data end up or what they signify has an easy answer: a 
direct comparison of the noise superimposed on the patterns reveals that as one could expect the more 
rare locations are found (more frequently) in the noise (ie. when the actual location (X,T,A) does not 
agree with the pattern-indicated location (X,T,B), Supplementary Figure 26). 

We used a null-model, included in the Figure [3Jd, to compare the entropies: assume the pattern(s) 
can be measured by a concentrated probability p c for a pattern (for the descriptive location), plus the 
residual variations around it in N — 1 locations, where the N is found for each user separately as above. 
This means, that the 1 — p c part is assumed to correspond to a completely random case. This would 
mean that the patterns (descriptive location) are just mixed with noise. Then, the maximal entropy can 
be written as e c = e(j> c ) + (1 — p c ) \n(N — 1). We find that the distributions are ranked so that the one 
for the patterned cases has the smallest entropies. This means most likely that additional information is 
hidden in the noise. The entropies for the user set do not extend in this case to very high values. The 
median entropies are 1.16 and 0.56, respectively. Such values have the simple interpretation that there 
are around three relevant locations for any slot, which decreases to less than two for a typical pattern. 

Predictability of locations 

To which degree one can predict a single user's behavior from the current state, ie. location A at time T? 
If the entropy with patterns would be zero, this could be easy: just know the "actual pattern" to get a 
perfect accuracy. Such regularities or high predictability are analogous to what is indicated in some cases 
by common sense ("at sleep during night", Supplementary Figure S27). We consider next predictions 
similar to weather forecasting: first, using the "average" behavior as in what is the expectation for a slot, 
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and second, by asking what is the probability ps(t + At) of the location B at t + At if the user was at A at 
t? The second method amounts to using the measured transition matrices and checking what they imply 
for the future from the current location, comparable to the recent Markov chain idea |25j . Since in both 
cases we can also restrict the days to one pattern at a time and average then over the patterns, we have 
in effect four different methods. The recent activities on "next location prediction" and similar topics 
quote a rather large number of related methods; our results below are better than any of those known to 
us from the literature [26H32] though Gambs et al. [25] reach comparable levels using GPS-based data. 
Etter et al. [32) for instance note also the possible roles of data sparsity, noise, and personal variations. 
Of these we analyze the two last ones below in more detail. 

The first prediction quality is TlA(t), user will be at time t. II t averaged over the whole day is the 
total prediction quality. With pij C the probability to be at i at time slot j in pattern c, then, quite simply 
n t = pfj c , with the average over all i, c. This method amounts to utilizing the measured probabilities 
to get a a guess where a user would be - consider a two-state model where these locations would have 
probabilities p\ and P2 = 1 — Pi , in which case such a guess works with the quality p\ + p\ . The results 
are shown in Fig. @Jt. The prediction accuracy improves with patterns to 0.53 ± 0.12 from 0.36 ± 0.11 
(without). 

With the second method, considering the equivalent case by setting At = 1 the accuracy is much 
higher. The average value changes adding the patterns from 0.74 ± 0.07 to 0.78 ± 0.06 (Fig. HJd), the 
best result of the four different methods tried. The small difference or change from the patterns implies 
that they are already quite well encoded in the original transition probabilities. This is natural since here 
we consider the transitions (or mobility), not the static probabilities. Thus in the case of the previous 
example one could find that the two states always change to the other at the next timcstcp: this has a 
quality of one. The maximum mean predictability is suggested by the method and data to be close to 0.8 
even with further refinement in finding patterns, due to this small improvement. The upper limit of the 
last case, with patterns, as observed from the distribution of personal predictability values found here, 
is quite close to the maximum limit postulated for the predictability of mobility patterns, 0.93 [1]. This 
might not be surprising, that the predicability of daily routine is constrained by the randomness-induced 
limitations in mobility prediction. 

The patterns imply that the transitions contain useful information, which then brings us to look at 
long-range predictability, At > 1. The quality decays logarithmically with At, Fig. 0J), or possibly as 
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a power law with a small exponent (Supplementary Figure S28). We have not considered longer-range 
predictions than those depicted here mostly due to data limitations. Surely this question is interesting 
and has non-trivial features - the sleep-wake cycle, the weekly cycle of work and leisure among others. 
The slow decay observed or measured implies that life in its regularity is - again given that there are 
such patterns - actually relatively quite predictable. In particular, note that the quality remains better 
than the average of the location method for At = 1, even up to time-delays of several hours. Below, we 
discuss the regularities of working days, where a high level of predictability is often quite given a priori. 

Predictability has large personal variations (Fig. |4|:) due to entropy if one tries predicting without 
exploiting the patterns. The range in which the first type prediction quality varies shows a substantial 
variation from "easy to predict" to "difficult to predict" , which correlates as stated with the entropy 
(Supplementary Figures S29-S30). Joe's life would be appear to be quite impossible to capture without 
patterns, but including those increases the predictability substantially "upto" the level in the case of 
Jane. In other words, the implication would be that we are slaves of our daily-life habits, but those can 
be quite plentiful and different and still exhibit typical deviations. 

Departing from work 

The regularity in the data includes and implies long-range daily correlations: the behavior depends on 
the past. The easiest example of this is the working day length, which we consider next. Figure [5] depicts 
the distribution for the probability to leave from work (to any other location) , for users that have regular 
working patterns and enough data, slightly more than a half of the total (Supplementary Figure S31). 
The data has been first scaled by using the mean and the standard deviation for all such users so that 
one is able to consider the sum distribution. 

The tails of the likelihood are different on both sides of the maximum. The Figure [3] includes a fit 
of a model (see also Supplementary Figure S32). First we assume that the time spent at work is shorter 
in a particular day when one leaves for a specific reason (visiting a doctor, meeting friends/family...). 
The action of leaving from work has a likelihood per time step, which increases by a constant factor by 
each step until the day has reached its typical length. This assumption means that it becomes more easy 
to quit working if the resulting deviation from the usual is shorter. Sometimes on the other hand work 
requires extra attention. In the model, we assume simply that if the day lasts beyond the typical length 
the probability per time step to leave increases with another parameter. This means that it becomes 
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more difficult to stay for each further step, a trend which is exactly the same as assumed for the other 
case of leaving early. 

These assumptions boil down to two exponentially increasing trends to leave early or not to stay any 
longer, which can be used to fit a distribution to the data. For the data at hand, the model works fairly. 
The central assumption behind the model is that the average working day length has for each user a 
deterministic influence on the decisions to change once from the usual habit, perhaps a triviality in many 
professions. This conditions the patterns over a large time span, as seen earlier in the context of long-range 
mobility prediction (compare with. Fig. |4}d). Wc also tried to correlate the departure from work with the 
personal communication records. This amounts to looking at call and text message patterns and partners 
(numbers for in- and out-coming communications considered also in the light of their frequency). The 
idea would be roughly: " I go since I made a call or got a call" , possibly also conditioned on the quality 
of the communication: cither a rare or frequent contact as deduced from the randomized phone number. 
However, such attempts were left inconclusive. One reason is that the persons considered in this study 
on average communicate (calls, text messages) fairly rarely, so any possible causality can not be derived 
from data. The lack of correlation in any case implies a relatively minor role of such communications in 
enforcing a particular departure from routine. 

Conclusions 

To summarize, smartphone -based studies allow for quantitative insights into the fundamentals of human 
life. We have here analyzed such data with the approach of first extracting patterns and a measure of 
randomness, and then analyzing the predictability of individual mobility. This utilizes the coverage in 
space and time available from the data-set. All in all, this means that such spatiotcmporal patterns, 
together with the personal randomness, define quantitatively individuals and populations. Patterns and 
entropy relate to the degree the daily locations or activities are predictable. We have demonstrated a 
fairly good quality (0.78 in the best of the four methods tried). 

The data-set we have at our disposal is just sufficient for the present purposes. The user panel, with 
its particular socio-economic and geographic background (for a comparison, consider Ref. |33j ) and its 
size may introduce various bias, which however pales in importance if put into contrast with a few bigger 
issues in the context of human life analysis. We already find noticeable variations in the simplest personal 
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quantitative characteristics: the number of locations to describe the personal life, the number of patterns, 
and the entropy, in the case at hand. It appears that the consequent realistic predictability arises from 
the presence of both patterns and noise in the life. Quite good predictability - the best such to our 
understanding so far - is found from the combination of using mobility and the discovered life patterns. 

One follow-up question is: how do entropy and predictability change from culture to another, including 
on-line games and so forth [34], or during longer spans of an individual's life. We are here looking at 
both of these quantitative characteristics with the somewhat contradictory assumption that the persons 
described by the data are "in an equilibrium". Thus, another important question is: how does the 
eventual lack of stationarity matter here, and how should it look like in the data? Life is not periodic, as 
it might be in weekly or monthly patterns over a finite time span, and moreover a society or culture is 
in a state of permanent flux. What do the inevitable trends and transients look like, for a person? Are 
there collective phenomena [35| quantified by the entropy, patterns, and predictability computed over 
long but finite windows in time - as during the times of a crisis in a society? Such questions will need to 
be answered by access to much larger data-sets. 

Methods 

The location data consists of the raw cell ID timestamps and the coordinates of the corresponding base- 
stations from a smartphonc-based application pjl [3"BT - |3"5] . The mapping of cell ID is difficult for issues 
inherent to the mobile phone technology |39j . for which reason we used two methods (Text SI, Supple- 
mentary Figures Sl-3): offline clustering [TS] and a fingerprint based method. This kind of comparisons 
are useful since such approaches can not be perfect. The original data set has more than 500 users, 
from a commercially conducted "user panel" in southern Finland during 2008-10, before processing. The 
participants ("panel members") gave written consent ("user agreement") also concerning the gathering 
of the data and its use for various purposes in the anonymized format, including scientific studies. The 
location data is further coarse-grained into time-slots of uniform length resulting in the triplets (X,T,A) 
from above (see also [30,41 ). Here, we use the simplest possible approach where the location A for the 
slot is chosen by a majority rule from the detailed location information during the slot length (see in 
general the Supplementary Figures S4-S11). 

The data contains gaps, typically since the phone is closed down and so is then the data gathering 
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application (Supplementary Figures S12-S14). The time-slots missing arc partly covered by a padding 
technique (user X closes the phone at home from 01.00 to 05.00 being a typical case). Such gaps seem in 
general to contain some information on the user behavior. Finally, days with more than 30 % of missing 
data after this procedure were entirely excluded, reducing the missing data from initially almost 50 % to 
less than 1 %. The slot length has been varied from 30 minutes to 2 hours, out of which we show in this 
work representative results for the 1 hour case (recall the daily 24 hour cycle). The original data-set of 
several hundreds of users is at the end pruned upto 66 persons, for each of which at least 30 full days 
of data suitable for analysis were found. In addition to the location data, we also have access to a call 
and message data set for the same persons - except for the "padded" time slots of course. This allows in 
principle to study the correlations of user activity with both the communications patterns and individual 
communication events such as text messages and calls, including the anynomized phone numbers. 

The patterns are resolved by a cluster-analysis (Supplementary Figures S15-S26). The analysis is 
done to establish the typical days or a set of characteristic behaviors. Note that we assume directly that 
the human behavior here can be classified in this way : that there is not simply a continuous spectrum of 
eigenbehaviors in contrast to a discrete set [5J , and that it is reasonable to consider this on the particular 
level of time-discretization (1 hour slot length). The easiest division here might be "work-days" and 
" weekend-days - that separate to subclasses as " Saturday" and " Sunday" - but a priori this is not given. 

We use two methods to compare the influence of the technique; naturally the result depends e.g. 
on the similarity metric used for a pair of day- vectors. The first is fc-means clustering and Euclidean 
metrics on a binary expansion (see Ref. [B]) of the data-set. The second method is community detection 
on weighted graphs, where the days constitute the vertices and the weights depend on the weighted 
Hamming distance dh between any two as 

Wij = cxp(-f3d h (i, j)), (4) 

where /3 is a parameter. We apply an extremal optimization method (EO) [42], so that we can find the 
number of clusters k as a self-organized outcome. We compute a single clustering, while for the fc-means 
we do 200 runs with random initial conditions. 
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Supporting Information Legends 

Supporting Information SI contains information and further details on the data handling and processing, 
and results (6 MBytes). 



Figure Legends 




Figure 1. Important places in life: Average distribution (relative frequency) of locations in the 
timcslots. The data is averaged over all the different users in the set. A power-law fit to the statistics is 
also indicated, with an exponent -1.69. We also show two different random users with the "personal" 
statistics plus the cut-offs (see text) for the analysis of patterns. 
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Figure 2. Patterns of life: Joe Random and Jane Regular: Upper panel: Clustering result for 
the dayvectors of user Joe. Lower panel: Clustering result for the dayvectors of user Jane. In both 
panels, in the first row are the results obtained with the extremal optimization algorithm and in the 
second row the corresponding result with the fc-mcans clustering. Each color is one location — the 
clusters are separated by white lines. The first column shows the unclustercd days. The second and 
third column show the clustered days and the corresponding average days of each cluster. The last 
column shows the deviations from the average day, where the color of the location is only shown if it 
deviates from the location of the average day of the cluster — otherwise the color is black. 
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Figure 3. Statistics of patterns, and life entropy: a) The clustering results for 58 accepted users 
with the first place definition (off-line clustering) and a time slot of 1 hour. The clustering was done 
with the EO algorithm and ft = 10/24. b) The cumulative distributions of the entropies for the three 
cases - bare entropy, reference model, and after clustering. 
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Figure 4. Predicting life: a) The prediction accuracy of the first, location based method in the case 
of with clustering and without as a function of time during the day. The clustering is from the EO 
algorithm with (3 = 7. b) The predictions by the transfer matrix method for various /3 and AT up to 12 
hours. The dashed line is a logarithmic fit to the /3 = 7 case, c) The correlation of the personal entropy 
and the resulting predictability for both the prediction methods (/? = 7). The entropy values are either 
the "bare" ones or those after clustering if that is used to aid prediction. 
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Figure 5. Predicting leaving work: Length-of-working-day distribution in scaled units and the 
model fit. The parameters for the leaving early and late -processes are 0.045 and 0.046, respectively (in 
inverse c-slot lengths). 



